General introduction: Basic Statistics in R

On Day 2, we will focus on learning basic statistics and calculating them in R. The learning objectives for today are:

We’ll be using datasets that come with the R distribution. They are immediately available when you load R. Before we begin, I want to give a hat-tip to Jukka-Pekka Verta who first developed this course.

Exercise: Confirm that you have access to the iris and Puromycin datasets.

Variable and analysis types

Dependent and independent variables

“Dependent” variables are the outcomes of your experiments (i.e. “the thing you are working on”), and are also referred to as “response” variable. The “independent” variables are those you manipulate (i.e. “the thing that explains what you are working on”), these are also referred to as “predictor” or “explanatory” variables.

Continuous vs. discrete variables

  • “Continuous” variables are those that, at least theoretically, can be measured with arbitrary precision. Height, weight and temperature are continuous variables.
  • “Discrete” variables cannot be measured with arbitrary precision because the data takes distinct values. Gender is a discrete variable. Discrete data can be further classified into: 1. “Categorical” data, which is often qualitative categories, with no relationship between the categories. This is also called “nominal”. Gender is a discrete, nominal variable. 2. “Ordinal” data, in which discrete values have an intrinsic ordering. An example of an ordinal variable is if you perform an experiment with “low”, “medium”, and “high” treatment levels.

Variables vs analysis type

Use the table below to determine the type of analysis that is appropriate for each types of variables (table not exclusive).

Predictor variable Method
Continuous Regression
Categorical Analysis of variance (ANOVA), Regression
Continuous & categorical Analysis of covariance (ANCOVA), Regression
Response variable Method
Continuous Regression, ANOVA or ANCOVA
Proportion Logistic regression
Count Log-linear models

Summary statistics and data exploration

Once you have data for an experiment you will want to look at multivariate summary statistics and visualise your data. R has several ways of doing this easily. This is one of the most important steps in your analysis! Many people jump into an analysis without looking at their data. Looking at summary statistics and more importantly plots can alert you to non-normality, missing data, improper data import or conversion, etc.

Summary statistics and plotting data

Your analysis should always start with visualising your data and checking summary statistics. We will use the “iris” dataset for this exercise.

The most basic thing to do is to start with:

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

The R function aggregate allows you to summarise your data with a specific function you decide.

aggregate(iris[, 1:4], by = list(Species = iris$Species), FUN = mean)
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026
aggregate(iris[, 1:4], by = list(Species = iris$Species), FUN = sd)
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa    0.3524897   0.3790644    0.1736640   0.1053856
## 2 versicolor    0.5161711   0.3137983    0.4699110   0.1977527
## 3  virginica    0.6358796   0.3224966    0.5518947   0.2746501

Exercise: Make sure you understand what aggregate is doing. Use ?aggregate to print out the help document.

You also want to inspect your data with plots. The command we start with is called, unsurprisingly, plot().

plot(iris$Sepal.Width)

The previous line applies plot to the whole data, which is obviously not what you want. You want the sepal width by species. The plot parameter par(mfrow=c(x,x))) allows you to define an arbitrary number of plots you want to display in one window. For example, c(1,3) means that we want one row and three columns of plots.

par(mfrow = c(1, 3))
plot(iris$Sepal.Width[which(iris$Species == "setosa")])
plot(iris$Sepal.Width[which(iris$Species == "versicolor")])
plot(iris$Sepal.Width[which(iris$Species == "virginica")])

Exercise: Annotate the x- and y-axes of your plot correctly. Use ?plot to find the options you need to specify.

It’s often helpful to inspect a boxplot instead of the raw data by index. Boxplots allow you to squeeze more data into a single plot. The tilde command ~ reads “as a function of”. In the following plot we want sepal length to be plotted as a function of species.

par(mfrow = c(1, 2))
boxplot(iris$Sepal.Length ~ iris$Species, main = "Sepal length", 
    las = 2)
boxplot(iris$Sepal.Width ~ iris$Species, main = "Sepal width", 
    las = 2)

Using the basic plot command often means you need to do a lot of typing and that’s where the package ggplot2 comes in handy. Start by melting the dataframe to just three columns with melt from the package reshape.

library(reshape)
irisMelt = melt(iris)
## Using Species as id variables
head(irisMelt)
##   Species     variable value
## 1  setosa Sepal.Length   5.1
## 2  setosa Sepal.Length   4.9
## 3  setosa Sepal.Length   4.7
## 4  setosa Sepal.Length   4.6
## 5  setosa Sepal.Length   5.0
## 6  setosa Sepal.Length   5.4

Exercise: Make sure you understand what melt is doing to your data.

Why do we manipulate our data with melt? Using melt and ggplot allows you to “wrap” your variables into categories and plot them all together in aestethically pleasing (and publication-quality) manner. Ggplot works a bit differently from most R functions. Think of ggplot as a figure where you add layers upon layers of different data. This takes some getting used to, but in the end makes the construction of complex figures much more straight forward. Your ggplot figure code is also easy to read.

library(ggplot2)
g = ggplot(irisMelt)  # the first line initiates ggplot with your data frame
g = g + geom_boxplot(aes(x = Species, y = value))  # second line defines that we want a boxplot, and which columns of the data frame to plot
g = g + facet_wrap(~variable)  # third line defines that we want to separate each measurement into its own bowplot
g  # last line executes the ggplot command

A cool feature of constructing ggplots element-by-element like this is that you can add more things to existing ggplots. Let’s add all the raw data points on top of the boxplots.

g = g + geom_jitter(aes(x = Species, y = value), cex = 0.5)  # geom_jitter adds the datapoint and cex=.5 shrinks the datapoints by 50% relative to default
g  # and then plot the ggplot again

Variance

The level of variance in your data is so important that it deserves a special section in this work-though exercise. You calculate the variance for two main reasons:

Let’s start by looking at the variation of sepal width in one of the species.

par(mfrow = c(1, 1))
plot(iris$Sepal.Width[which(iris$Species == "setosa")])

A fundamental measure of variation (scatter) in data are the RESIDUALS. Residuals are the deviation of measured values from the mean value, this is best explained by a plot.

plot(iris$Sepal.Width[which(iris$Species == "setosa")], pch = 20)
abline(h = mean(iris$Sepal.Width[which(iris$Species == "setosa")]), 
    col = "green")
for (i in 1:length(which(iris$Species == "setosa"))) {
    meanWidth = mean(iris$Sepal.Width[which(iris$Species == "setosa")])
    observedWidth = iris$Sepal.Width[which(iris$Species == "setosa")][i]
    lines(c(i, i), c(meanWidth, observedWidth), col = "red")
}

An important property of residuals is that they all add up to zero, irrespective of the level of variability in the data. This means that we can’t use the residuals as such to measure variability in the data. Instead, we can get rid of the minus signs in residuals by taking the square of them before adding up. This is the SUM OF SQUARES (SS), perhaps the most central quantity in statistics (Crawley 2015). Lets define a function to calculate the SS.

SS = function(x) {
    observedValues = x  # this is just to rename the 'x' so that the calculation is clearer
    meanValue = mean(observedValues)
    squareResids = (observedValues - meanValue)^2
    return(sum(squareResids))
}

SS(iris$Sepal.Width[which(iris$Species == "setosa")])
## [1] 7.0408

There’s also a shortcut way to calculate the SS, given in the function below. This is the form that is often used in text books when talking about correlation and regression (called the SSY).

SSY = function(x) {
    observedValues = x
    return(sum(observedValues^2) - sum(observedValues)^2/length(observedValues))
}

SSY(iris$Sepal.Width[which(iris$Species == "setosa")])
## [1] 7.0408

Obviously the SS increases with more data, so we need to normalize the SS to the sample size. This still isn’t good enough estimate of the variation in the data because we have we have already estimated one parameter from the data: the mean. This is why we use the DEGREES of FREEDOM (DoF) instead of the sample size to normalize the SS. The DoF is the calculated as the sample size n minus the number of parameters estimated. In our case we estimated one parameter (the mean) so DoF = n-1. Dividing the SS by DoF gives the VARIANCE of the data. Let’s write a function to calculate the variance.

variance = function(x) {
    sumOfSquares = SS(x)
    DoF = length(x) - 1
    return(sumOfSquares/DoF)
}

variance(iris$Sepal.Width[which(iris$Species == "setosa")])
## [1] 0.1436898

Exercise: Check if our calculation of the variance was correct by comparing to the function var (which does the same thing we just worked through manually).

OK, so now we have an intuition of what the concept of variance means. Here it’s important to pause for a moment and recognise that what we have just calculated for the iris dataset is called the sample variance. So, what have we done? We have taken a sample of Iris setosa plants, measured their sepal widths and calculated the variance. Have we measured the sepal widths of all Iris setosa plants there exists? Obviously not, we took a sample. Therefore our measure of variance is a parameter that we estimated from the data. Now we come to two subtly different concepts that we often confuse with one another:

Both of these deal with estimating the variability or “spread” of your data around some mean value. We can establish the true variance of a variable only in the case when we measure every single case in a population (e.g. measuring the length of every single stickleback in the Neckar). Its obvious that in practice, we will never know the true variance of a variable. We therefore use the variance observed in a sample (the sample variance) as an approximation of the true variance. Because variance is an estimated parameter, it makes sense that its precision depends on the sample size.

Standard error (of the mean) and Confidence Intervals

Given that we are practically never able to measure all individuals in a population, our understanding of population parameters like the true mean and true variance will rely on sampling individuals from a population and calculating the sample mean and sample variance. In other words, we estimate the population parameters from the sample distribution. As in any case of sampling, the sample distributions are going to be inaccurate. It follows that our estimate for the true population mean or variance, the sample mean and the sample variance, are going to be inaccurate also. Ideally, we would like to quantify this inaccuracy. We use the concepts of STANDARD ERROR and CONFIDENCE INTERVALS to quantify the inaccuracy due to estimation of population parameters.

We can make two general predictions of the behaviour of the inaccuracy in parameter estimates.

Consider for example the sample mean. This is an estimated parameter that approximates the true population mean. To quantify the unreliability of our calculation of the mean we use the following formula:

The SE is always expressed in the same units as the measured variable. This means that we can represent the mean and it’s SE as a barplot with whiskers, like so:

Tip: The formal way of writing a confidence interval for a normally distributed variable is the following.

Where c is the desired level of confidence, e.g. 0.05 (5%), and Norm is the quantiles of the standard normal distribution (also called a z-score). Typical values of z for two-sided confidence intervals are given in the table below. You can also visit the wikipedia page https://en.wikipedia.org/wiki/Standard_score. Remember that sqrt(variance/sample size) is the standard error, so what we are essentially doing is scaling the standard error to achieve the desired level of confidence.

c Norm(alpha=1-c/2)
99% 2.576
98% 2.326
95% 1.96
90% 1.645
68% 1

The null hypothesis and the p-value

In formal hypothesis testing, we start with a well-thought null hypothesis, which is assumed to be true, and an alternative hypothesis, for which we are attempting to validate using our data. The essential point is that a null hypothesis is falsifiable. We reject the null hypothesis (which in practice is often “nothing varies” or “there is no difference between A and B”) when our data indicates that the null hypothesis is sufficiently unlikely. We base our decision each time on a value of a test statistics. If we observe a value of a test statistics that is very unlikely to be observed when the null hypothesis is true, we reject the null. In precise terms, a p-value is an estimate of the probability of observing a value of a test statistics, or a value more extreme than this, when the null hypothesis is true. You should memorise the last sentence. It is also important to memorise the following two things:

Whenever you encounter a p-value, you should know the underlying null expectation and the test statistics used to produce the p-value. This will clarify your logic and help you recognize errors in analysis as well as limits of the statistical tests.

Comparison of means

t-test

The t-test is used to compare two means. The idea is simple: given what we know about the variances within two groups of samples, how likely is it that the samples were drawn from two populations with the same mean?

One important assumpion of tests based on comparing two or more means is that the variance is equal in all compared samples. This is called CONSTANCY of VARIANCE. Why is this important? Because two samples can have the same mean but different variance. If you compare just the means you may conclude that the samples are identical. However, variance of the data is equally important in most contexts, so this conclusion would probably be wrong. In short: when the variances are different, we shouldn’t make inferences by comparing the means because this would be a fundamentally wrong approach (Crawley 2015). Therefore, if you are comparing means you need to make sure that the variances are identical.

Now let’s just quickly compare the variance of sepal widths in the three iris species.

var(iris$Sepal.Width[which(iris$Species == "setosa")])
## [1] 0.1436898
var(iris$Sepal.Width[which(iris$Species == "versicolor")])
## [1] 0.09846939
var(iris$Sepal.Width[which(iris$Species == "virginica")])
## [1] 0.1040041

We can test for the equality of the variances so:

var.test(iris$Sepal.Width[which(iris$Species == "setosa")], iris$Sepal.Width[which(iris$Species == 
    "versicolor")])
## 
##  F test to compare two variances
## 
## data:  iris$Sepal.Width[which(iris$Species == "setosa")] and iris$Sepal.Width[which(iris$Species == "versicolor")]
## F = 1.4592, num df = 49, denom df = 49, p-value = 0.1895
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.828080 2.571444
## sample estimates:
## ratio of variances 
##           1.459233
var.test(iris$Sepal.Width[which(iris$Species == "setosa")], iris$Sepal.Width[which(iris$Species == 
    "virginica")])
## 
##  F test to compare two variances
## 
## data:  iris$Sepal.Width[which(iris$Species == "setosa")] and iris$Sepal.Width[which(iris$Species == "virginica")]
## F = 1.3816, num df = 49, denom df = 49, p-value = 0.2614
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7840128 2.4346017
## sample estimates:
## ratio of variances 
##           1.381578
var.test(iris$Sepal.Width[which(iris$Species == "virginica")], 
    iris$Sepal.Width[which(iris$Species == "versicolor")])
## 
##  F test to compare two variances
## 
## data:  iris$Sepal.Width[which(iris$Species == "virginica")] and iris$Sepal.Width[which(iris$Species == "versicolor")]
## F = 1.0562, num df = 49, denom df = 49, p-value = 0.849
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5993724 1.8612363
## sample estimates:
## ratio of variances 
##           1.056207

Equally important is visualising the data with box plots.

boxplot(iris$Sepal.Width[which(iris$Species == "virginica")], 
    iris$Sepal.Width[which(iris$Species == "versicolor")])

In conclusion: all looks good for the iris dataset and we can continue with the exercise.

Now we have established that the variances in sepal width are similar in different species; comparison of their mean sepal widths is therefore justified. In practise our data are rarely perfect and formal test of equality of variance often reject the null hypothesis that variances are equal (even when the variances are only slightly different). We therefore need to consider our motivations for the test and think hard whether comparing the means is appropriate. Often we still want to do the comparison. It is therefore quite convenient that the default t-test in R performs a Welch Approximate t-test, which is robust against deviations from equity of variances. The actual test for comparing two means is straighforward to execute:

t.test(iris$Sepal.Width[which(iris$Species == "versicolor")], 
    iris$Sepal.Width[which(iris$Species == "virginica")])
## 
##  Welch Two Sample t-test
## 
## data:  iris$Sepal.Width[which(iris$Species == "versicolor")] and iris$Sepal.Width[which(iris$Species == "virginica")]
## t = -3.2058, df = 97.927, p-value = 0.001819
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.33028364 -0.07771636
## sample estimates:
## mean of x mean of y 
##     2.770     2.974

Remember that even the Welch test assumes that the two compared samples are independent and their errors are normally distributed. If we are certain that the variances are equal we can call the classical Student’s t-test so:

t.test(iris$Sepal.Width[which(iris$Species == "versicolor")], 
    iris$Sepal.Width[which(iris$Species == "virginica")], var.equal = T)
## 
##  Two Sample t-test
## 
## data:  iris$Sepal.Width[which(iris$Species == "versicolor")] and iris$Sepal.Width[which(iris$Species == "virginica")]
## t = -3.2058, df = 98, p-value = 0.001819
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.33028246 -0.07771754
## sample estimates:
## mean of x mean of y 
##     2.770     2.974

As you can see from the output it gives an (nearly) equal result to the Welch test.

Exercise: Spot the differences in the outputs of Student’s and Welch’s t-tests.

Paired t-test

In the case where the compared variables are drawn from paired samples (i.e. samples that are clearly related to each other in one way of the other), we use the paired t-test (using the paired test is when possible is always recommended because it is much more powerful).

For now we will compare sepal width and sepal lengths from the same samples.

boxplot(iris$Sepal.Width[which(iris$Species == "versicolor")], 
    iris$Sepal.Length[which(iris$Species == "versicolor")])

t.test(iris$Sepal.Width[which(iris$Species == "versicolor")], 
    iris$Sepal.Length[which(iris$Species == "versicolor")], paired = T)
## 
##  Paired t-test
## 
## data:  iris$Sepal.Width[which(iris$Species == "versicolor")] and iris$Sepal.Length[which(iris$Species == "versicolor")]
## t = -50.757, df = 49, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.291348 -3.040652
## sample estimates:
## mean of the differences 
##                  -3.166

Non-parametric t-test

In case where the errors are non-normal we can use the non-parametric Wilcoxon rank sum test.

wilcox.test(iris$Sepal.Width[which(iris$Species == "versicolor")], 
    iris$Sepal.Width[which(iris$Species == "virginica")])
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  iris$Sepal.Width[which(iris$Species == "versicolor")] and iris$Sepal.Width[which(iris$Species == "virginica")]
## W = 841, p-value = 0.004572
## alternative hypothesis: true location shift is not equal to 0

Remember that with normal data, the non-parametric test is ~95% as powerful. On the other hand, when data is non-normal (for example because of strong presence of outliers) the non-parametric test is much more powerful than the parametric test.

ANOVA

You use ANOVA when your explanatory variable(s) is categorical and you’re interested in the difference in mean values between categories. You can think of ANOVA as a case of t-test when you have more than 2 categories.

Let’s start by looking at the iris data again.

library(reshape)

library(ggplot2)

irisMelt = melt(iris)
## Using Species as id variables
head(irisMelt)
##   Species     variable value
## 1  setosa Sepal.Length   5.1
## 2  setosa Sepal.Length   4.9
## 3  setosa Sepal.Length   4.7
## 4  setosa Sepal.Length   4.6
## 5  setosa Sepal.Length   5.0
## 6  setosa Sepal.Length   5.4
g = ggplot(irisMelt)
g = g + geom_boxplot(aes(x = Species, y = value))
g = g + facet_wrap(~variable)
g

The fundamental aspect of ANOVA is that you use the variance within versus between classes to compare the means. If that sounds counterintuitive the following plot should make things clearer. Let’s look at sepal length in the whole data versus the mean of sepal lengths.

g = ggplot(iris)
g = g + geom_point(aes(x = 1:length(Sepal.Length), y = Sepal.Length))
g = g + geom_hline(aes(yintercept = mean(Sepal.Length)), colour = "green")
g

Now lets add the residuals. I’ve broken down the command to multiple lines so that its easier to see whats going on.

g = g + geom_segment(aes(x = 1:length(Sepal.Length), xend = 1:length(Sepal.Length), 
    y = Sepal.Length, yend = rep(mean(Sepal.Length), times = length(Sepal.Length))), 
    colour = "red")
g

The essence of ANOVA is that we compare the residuals relative to the category means and the residuals relative to the global mean of the data set. Lets first plot the category (species) means to the plot instead of the global mean.

p = ggplot(iris)
p = p + geom_point(aes(x = 1:length(Sepal.Length), y = Sepal.Length))
p = p + geom_hline(aes(yintercept = mean(Sepal.Length[which(Species == 
    "setosa")])), colour = "green")
p = p + geom_hline(aes(yintercept = mean(Sepal.Length[which(Species == 
    "versicolor")])), colour = "blue")
p = p + geom_hline(aes(yintercept = mean(Sepal.Length[which(Species == 
    "virginica")])), colour = "purple")
p

Now lets add the residuals again one species at a time.

p = p + geom_segment(data=iris[which(iris$Species=='setosa'),], # we want to plot only the setosa residuals so we specify that to geom_segment
  aes(
  x=which(iris$Species=='setosa'), # the x-cordinates of the setosa data points on our plot 
  xend=which(iris$Species=='setosa'), # the x-cordinates of the setosa data points on our plot
  y=Sepal.Length, # y-start of our lines are in data points of sepal lengths in setosa 
  yend=rep(mean(Sepal.Length),times=length(Sepal.Length))), # y-end of our lines are always at the mean sepal length in setosa
  colour='green',alpha=.5)
p # there's the residuals for setosa

p = p + geom_segment(data=iris[which(iris$Species=='versicolor'),],
  aes(
  x=which(iris$Species=='versicolor'),
  xend=which(iris$Species=='versicolor'),
  y=Sepal.Length[which(Species=='versicolor')],
  yend=rep(mean(Sepal.Length[which(Species=='versicolor')]),times=length(which(Species=='versicolor')))),
  colour='blue',alpha=.5)
p # residuals for versicolor

p = p + geom_segment(data=iris[which(iris$Species=='virginica'),],
  aes(
  x=which(iris$Species=='virginica'),
  xend=which(iris$Species=='virginica'),
  y=Sepal.Length,
  yend=rep(mean(Sepal.Length),times=length(Sepal.Length))),
  colour='purple',alpha=.5)
p # residuals for virginica

Here’s the catch – if mean sepal lengths in the three species were the same, how would the residual lines we just draw (green, blue and purple) would compare to the residual lines we draw previously to the global mean (red)? The answer is: if the species means were in the same place as the global means, the group residuals would be the same length as the global residuals. This is in essence what we are testing when we do ANOVA.

Next comes a tiny bit of R code tips. You migth have noticed that I names the first plot as “g” and the second as “p” – there’s a reason for that. We can plot multiple ggplots in the same window by using the package grid. Let’s compare the two plots side-by-side.

library(grid)
pushViewport(viewport(layout = grid.layout(1, 2)))
print(g, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(p, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))

Performing ANOVA

Ok, the last section should have built an intuitive understanding of the way how ANOVA works.

Lets move on to do the actual test.

summary(aov(iris$Sepal.Length ~ iris$Species))
##               Df Sum Sq Mean Sq F value Pr(>F)    
## iris$Species   2  63.21  31.606   119.3 <2e-16 ***
## Residuals    147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output reads:

  • Df = Degrees of freedom
  • Sum Sq = residuals
  • Mean Sq = variance
  • F value = variance(treatment) / variance(error) == explained variance divided by unexplained variance == signal-to-noise ratio
  • Pr(>F) = p-value associated with the F-value

Let’s visualise the means like we were producing a figure for a journal submission.

g = ggplot(iris)
g = g + geom_bar(aes(y = Sepal.Length, x = Species), stat = "identity")
g

Remember that our estimate of the mean is drawn from a sampling population, therefore, it will contain some level of uncertainty. Usually with barplot, if we’re plotting a mean, we want to represent the parameter uncertainty in our plot. Most usual case is to plot the SE of the mean as whiskers. You can also use the 95% Confidence Interval if you will. Let’s plot the SE for now.

First we need to calculate the SE, remember that to do this we need the variance and the sample size.

variances = aggregate(iris$Sepal.Length, by = list(Species = iris$Species), 
    FUN = var)
sampleSizes = aggregate(iris$Species, by = list(Species = iris$Species), 
    FUN = length)

se = sqrt(variances$x/sampleSizes$x)
means = aggregate(iris[, 1:4], by = list(Species = iris$Species), 
    FUN = mean)$Sepal.Length
anovaPlot = data.frame(means = means, se = se, species = levels(iris$Species))

Now we’re ready to include the error bar in the plot.

g = ggplot(anovaPlot)
g = g + geom_bar(aes(y = means, x = species), stat = "identity")
g = g + geom_errorbar(aes(ymin = means - se, ymax = means + se, 
    x = c(1:3)))
g

The second error-bar option we will try is plotting the standard error of the ANOVA model. This means that we want to plot the error bars as a function of the pooled error variance we used to test the differences in the means i.e. the residual variance not explained by the model (be sure to state this in the figure legend!). We can get the pooled error variance from our ANOVA output by remembering that Mean Sq = variance.

Exercise: Try to spot the residual variance of the model in the table below.

summary(aov(iris$Sepal.Length ~ iris$Species))
##               Df Sum Sq Mean Sq F value Pr(>F)    
## iris$Species   2  63.21  31.606   119.3 <2e-16 ***
## Residuals    147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
sampleSizes = aggregate(iris$Species, by = list(Species = iris$Species), 
    FUN = length)

se = sqrt(0.265/sampleSizes$x)
means = aggregate(iris[, 1:4], by = list(Species = iris$Species), 
    FUN = mean)$Sepal.Length
anovaPlot = data.frame(means = means, se = se, species = levels(iris$Species))

g = ggplot(anovaPlot)
g = g + geom_bar(aes(y = means, x = species), stat = "identity")
g = g + geom_errorbar(aes(ymin = means - se, ymax = means + se, 
    x = c(1:3)))
g

The summary command gives a very hypothesis-testing driven output (concentrated on the p-value). It’s equally informative to look at effect sizes (differences of means in measured units) using the summary.lm function. In addition, the summary output tells us that there is AT LEAST ONE PAIR of groups that differ in means, but it doesn’t tell us which pair. To know that we will run a couple more commands (in jargon POST-HOC tests), the first one:

summary.lm(aov(iris$Sepal.Length ~ iris$Species))
## 
## Call:
## aov(formula = iris$Sepal.Length ~ iris$Species)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6880 -0.3285 -0.0060  0.3120  1.3120 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5.0060     0.0728  68.762  < 2e-16 ***
## iris$Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
## iris$Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared:  0.6187, Adjusted R-squared:  0.6135 
## F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

The output is at first quite cryptic but here’s how to read it in our case:

  • (Intercept) — mean of our first group (setosa) (in units we measured) [to find out the setosa was the first group in our mode run “levels(iris$Species)”]
  • iris$Speciesversicolor — difference of versicolor mean from the setosa mean
  • iris$Speciesvirginica — difference of virginica mean from the setosa mean

All following columns follow logically from this:

  • first row of Std error is the standard error of the mean in setosa
  • second and third rows are the standard error of the differences in means between the respective groups and setosa
  • t value of the first row is the test statistics associated with observing a mean of setosa different from 0
  • t value of the second and third rows are test statistics associated with the differences in respective group means from setosa mean
  • last column gives the p-values associated with the test statistics

R-squared indicates that ~62% of the variability in the data can be accounted by differences in species means. Notice that the last line gives the overall ANOVA test statistics (F), so we wouldn’t actually even need to run the simple summary command in the first place.

Assumptions

We still need to check the assumptions – normally distributed errors and equal variance. The plots are used to check for the following assumptions:

  • Upper left - constancy of variance
  • Lower left - constancy of variance (alternative scale)
  • Upper right - normally distributed errors
  • Lower right - highly influential data points
par(mfrow = c(2, 2))
plot(aov(iris$Sepal.Length ~ iris$Species))

Non-parametric ANOVA

There are many options to perform a non-parametric version of ANOVA. These include the Kruskal Wallis test and the Mann-Whitney U test (aka Wilcoxon rank sum test). You perform a non-parametric ANOVA if your data is not normally distributed.

Exercise: use ?kuskal.test and ?wilcox.test to get familiar with non-parametric options for ANOVA. Another good resource is the Quick-R website: http://www.statmethods.net/stats/nonparametric.html

If non-parametric tests have fewer assumptions than parametric tests, why don’t we use them all the time? Because they are far less powerful when data actually is normally distributed. Non-parametric tests are not affected by outliers because it doesn’t matter how far they lie, but at the same time, extreme (but still normally distributed) values cannot add to the detectable signal. What this means is that you need more data for a similar p-value. As with an ANOVA, the null expectation for both kuskal.test and wilcox.test is that the means are all the same, so a significant result only says that at least one group is different than one other group.

Comparing two quantitative variables

Visualising the data

Often our experiments are such that we are interested in the relationships between two continuous variables. Let’s visualize the relationship between sepal and petal lengths. Both of these variables are continuous, which means that the appropriate statistical tools to analyse their relationship are CORRELATION and REGRESSION.

First let’s plot the two variable using the original data frame and the basic plot command.

par(mfrow = c(1, 3))
plot(iris$Sepal.Length[which(iris$Species == "setosa")], iris$Petal.Length[which(iris$Species == 
    "setosa")])
plot(iris$Sepal.Length[which(iris$Species == "versicolor")], 
    iris$Petal.Length[which(iris$Species == "versicolor")])
plot(iris$Sepal.Length[which(iris$Species == "virginica")], iris$Petal.Length[which(iris$Species == 
    "virginica")])

The basic plot command produces quite nasty looking plots and you would need to specify for example the x- and y-axis labels to make it look nicer. Also note that the x- and y-axis scales are different in all three subplots – this may be misleading and doesn’t produce a very good overall representation of the whole data. Using ggplot helps with these issues, so lets use it to look at the relationship between sepal and petal lengths. In our case it’s most convenient to use the original data frame where sepal and petal lengths are in different columns.

g = ggplot(iris)
g = g + geom_point(aes(Petal.Length, Sepal.Length))
g = g + facet_wrap(~Species)
g

Note how using facet_wrap forces the x- and y-axis scales to remain the same in all subplots. This gives you a better sense of scale of the data and forces good habits. You also save a lot of typing and reduce the possibilities for bugs compared to the basic plot command.

Correlation

Correlation measures the independence vs. dependence of two continuous variables, and the direction and degree to which two variables change together - correlation analysis will tell you if one variable will tend to increase or decrease as the other variable increases or decreases. Remember that correlation analysis does not test causal hypotheses.

To calculate the correlation between petal and sepal lenghts, all we need to do is:

cor.test(iris$Sepal.Length[which(iris$Species == "setosa")], 
    iris$Petal.Length[which(iris$Species == "setosa")])
## 
##  Pearson's product-moment correlation
## 
## data:  iris$Sepal.Length[which(iris$Species == "setosa")] and iris$Petal.Length[which(iris$Species == "setosa")]
## t = 1.9209, df = 48, p-value = 0.0607
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01206954  0.50776233
## sample estimates:
##       cor 
## 0.2671758

Now I would like to get the p-value from this correlation. How can I visualize the attributes of a given object? Easy! The command is attributes

mycor = cor.test(iris$Sepal.Length[which(iris$Species == "setosa")], 
    iris$Petal.Length[which(iris$Species == "setosa")])

attributes(mycor)
## $names
## [1] "statistic"   "parameter"   "p.value"     "estimate"    "null.value" 
## [6] "alternative" "method"      "data.name"   "conf.int"   
## 
## $class
## [1] "htest"
rval = cor.test(iris$Sepal.Length[which(iris$Species == "setosa")], 
    iris$Petal.Length[which(iris$Species == "setosa")])$p.val

So what does the value of correlation coefficient (rho) mean? Correlation considers the variance in x, the variance in y and their covariance (how the two co-vary). Values of rho always vary between -1 and 1. Values near -1 indicate strong negative correlation, while values near 1 indicate strong positive correlation. When x and y are uncorrelated their covariance (and thus rho) is 0. Rho = 0 indicates a lack of linear association between the two variables. Think of the lack of correlation as a plot of the two variables without any kind of structure – just a random cloud of points. Correlation on the other hand means that there is a structure to that cloud of points – maybe it’s stretched on the diagonal and the could is pointing to upper right, in which case the correlation would be positive.

Note that the default Pearson’s correlation test assumes that x and y are normally distributed. There’s also the non-parametric version of Spearman’s rank correlation (in case of non-normality).

cor.test(iris$Sepal.Length[which(iris$Species == "setosa")], 
    iris$Petal.Length[which(iris$Species == "setosa")], method = "spearman")
## Warning in cor.test.default(iris$Sepal.Length[which(iris$Species ==
## "setosa")], : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  iris$Sepal.Length[which(iris$Species == "setosa")] and iris$Petal.Length[which(iris$Species == "setosa")]
## S = 15017, p-value = 0.04985
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.2788849

Exercise: Explore the correlation of sepal and petal lengths in each three species of the iris dataset. Which species shows the most and which shows the least level of correlation? How come can we use Spearman’s correlation test for non-normal data?

Regression

The essence of regression is to define a statistical model that describes the relationship between the response variable and the explanatory variable(s). You perform regression when you want to explain variation in the response variable (y) with variation in the explanatory variable (x). The simplest model of all is the linear model.

y = a + bx

This means that y is a product of a baseline value (a) and the value of x times a constant b.

We therefore need to estimate at least two parameters:

Exercise: Discuss the differences of correlation and regression.

The correlation between petal and sepal lengths in setosa-species was not that good, so let’s use the same variables in the virginica-species instead for our exercises. As usual, we begin our analysis by visualising the data.

par(mfrow = c(1, 1))
plot(iris$Sepal.Length[which(iris$Species == "virginica")], iris$Petal.Length[which(iris$Species == 
    "virginica")], pch = 20)

Simplest linear model

We want to explain variation in petal length by variation in sepal length. We do this by defining a linear model. To estimate (the simplest) linear model between petal and sepal lengths we use the lm function.

fit = lm(iris$Petal.Length[which(iris$Species == "virginica")] ~ 
    iris$Sepal.Length[which(iris$Species == "virginica")])
fit
## 
## Call:
## lm(formula = iris$Petal.Length[which(iris$Species == "virginica")] ~ 
##     iris$Sepal.Length[which(iris$Species == "virginica")])
## 
## Coefficients:
##                                           (Intercept)  
##                                                0.6105  
## iris$Sepal.Length[which(iris$Species == "virginica")]  
##                                                0.7501

Exercise: Read aloud the lm function call (with your own words) we just executed.

Now let’s add the estimated regression model to the plot.

par(mfrow = c(1, 1))
plot(iris$Sepal.Length[which(iris$Species == "virginica")], iris$Petal.Length[which(iris$Species == 
    "virginica")], pch = 20)
fit = lm(iris$Petal.Length[which(iris$Species == "virginica")] ~ 
    iris$Sepal.Length[which(iris$Species == "virginica")])
fit
## 
## Call:
## lm(formula = iris$Petal.Length[which(iris$Species == "virginica")] ~ 
##     iris$Sepal.Length[which(iris$Species == "virginica")])
## 
## Coefficients:
##                                           (Intercept)  
##                                                0.6105  
## iris$Sepal.Length[which(iris$Species == "virginica")]  
##                                                0.7501
abline(fit, col = "green")

Model testing

Now we are ready to interpret the results of our very first regression analysis.

summary(fit)
## 
## Call:
## lm(formula = iris$Petal.Length[which(iris$Species == "virginica")] ~ 
##     iris$Sepal.Length[which(iris$Species == "virginica")])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68603 -0.21104  0.06399  0.18901  0.66402 
## 
## Coefficients:
##                                                       Estimate Std. Error
## (Intercept)                                            0.61047    0.41711
## iris$Sepal.Length[which(iris$Species == "virginica")]  0.75008    0.06303
##                                                       t value Pr(>|t|)    
## (Intercept)                                             1.464     0.15    
## iris$Sepal.Length[which(iris$Species == "virginica")]  11.901  6.3e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2805 on 48 degrees of freedom
## Multiple R-squared:  0.7469, Adjusted R-squared:  0.7416 
## F-statistic: 141.6 on 1 and 48 DF,  p-value: 6.298e-16

You can also have a look at the sum of squares assigned to the model and the residuals so:

summary.aov(fit)
##                                                       Df Sum Sq Mean Sq
## iris$Sepal.Length[which(iris$Species == "virginica")]  1 11.147  11.147
## Residuals                                             48  3.778   0.079
##                                                       F value  Pr(>F)    
## iris$Sepal.Length[which(iris$Species == "virginica")]   141.6 6.3e-16 ***
## Residuals                                                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To understand the aov table you need to know the following:

  • “Df” == Degrees of Freedom
  • “Mean sq” == variance
  • “F value” == variance(model) / variance(residuals) == explained variance / unexplained variance == signal-to-noise ratio
  • “Pr(>F)” == the probability of observing an F value as large or larger by change given that the null hypothesis was true (the p-value)

Exercise: Interpret the result of our regression analysis. Describe the results to the person sitting next to you (whats our conclusion, which parameters are significative and which are not, how good do the parameter estimates look like).

Fit of the model

An important measure to consider is the degree of fit of the model. For this exercise, let’s add some noise into the petal lengths so that the spread of the data is larger.

noisyData = iris[which(iris$Species == "virginica"), ]  # initiate the data frame with virginica data

noisyData$Petal.Length = noisyData$Petal.Length + rnorm(length(noisyData$Petal.Length), 
    sd = 0.5)  # add random numbers to petal length drawn from a normal distribution with a mean of 0 and sd=.5

Compare the regression models for the two datasets:

fit1 = lm(iris$Petal.Length[which(iris$Species == "virginica")] ~ 
    iris$Sepal.Length[which(iris$Species == "virginica")])

fit2 = lm(noisyData$Petal.Length ~ noisyData$Sepal.Length)

par(mfrow = c(1, 2))
plot(iris$Petal.Length[which(iris$Species == "virginica")] ~ 
    iris$Sepal.Length[which(iris$Species == "virginica")], pch = 20, 
    xlim = c(5, 8), ylim = c(3, 8), ylab = "Petal length", xlab = "Sepal length", 
    main = "original")
abline(fit1, col = "green")
plot(noisyData$Petal.Length ~ noisyData$Sepal.Length, pch = 20, 
    xlim = c(5, 8), ylim = c(3, 8), ylab = "Petal length", xlab = "Sepal length", 
    main = "with noise")
abline(fit2, col = "green")

Compare the R-squared terms of the two models. The R-squared tells you how much of variation in y is explained by variation in x.

summary(fit1)
## 
## Call:
## lm(formula = iris$Petal.Length[which(iris$Species == "virginica")] ~ 
##     iris$Sepal.Length[which(iris$Species == "virginica")])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68603 -0.21104  0.06399  0.18901  0.66402 
## 
## Coefficients:
##                                                       Estimate Std. Error
## (Intercept)                                            0.61047    0.41711
## iris$Sepal.Length[which(iris$Species == "virginica")]  0.75008    0.06303
##                                                       t value Pr(>|t|)    
## (Intercept)                                             1.464     0.15    
## iris$Sepal.Length[which(iris$Species == "virginica")]  11.901  6.3e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2805 on 48 degrees of freedom
## Multiple R-squared:  0.7469, Adjusted R-squared:  0.7416 
## F-statistic: 141.6 on 1 and 48 DF,  p-value: 6.298e-16
summary(fit2)
## 
## Call:
## lm(formula = noisyData$Petal.Length ~ noisyData$Sepal.Length)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27728 -0.38847  0.09842  0.49454  1.14463 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              0.1914     0.8721   0.219    0.827    
## noisyData$Sepal.Length   0.8161     0.1318   6.193 1.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5865 on 48 degrees of freedom
## Multiple R-squared:  0.4442, Adjusted R-squared:  0.4326 
## F-statistic: 38.36 on 1 and 48 DF,  p-value: 1.269e-07

Assumptions

Regression assumes constancy of variance and normality of errors. Plotting the linear model allows visual inspection of these assumptions.

par(mfrow = c(2, 2))
plot(fit)

The top row is the most important. You want the residuals vs fitted values to be a random cloud of points without any tendency to increase or decrease. The q-q plot should be a straight line if the model residuals are normally distributed. The scale-location plot is like the first plot but on another scale - you don’t want the spread on y-axis to change as a function of x-values. The residual vs leverage highlights the influence of single datapoints on parameter estimates - ideally you want all the points inside the dashed lines.

Principal Component Analysis

Principal component analysis (PCA) is a statistical procedure commonly used to analyze major trends in multivariate data. Put simplistically, PCA clusters your samples by similarity and provides information on the variables responsible for this clustering. Below we will work through a PCA analysis with the Iris data, to determine the similarities in the morphological profiles of different plant species.

# Load data
data(iris)

First we log transform the data. Note that with the iris dataset this is not necessary as the data is not very skewed, but the log-transformation does not hurt…

log.ir <- log(iris[, 1:4])
ir.species <- iris[, 5]

# apply PCA - scale. = TRUE is highly advisable, but default
# is FALSE.
ir.pca <- prcomp(log.ir, center = TRUE, scale. = TRUE)

# print method
print(ir.pca)
## Standard deviations:
## [1] 1.7124583 0.9523797 0.3647029 0.1656840
## 
## Rotation:
##                     PC1         PC2        PC3         PC4
## Sepal.Length  0.5038236 -0.45499872  0.7088547  0.19147575
## Sepal.Width  -0.3023682 -0.88914419 -0.3311628 -0.09125405
## Petal.Length  0.5767881 -0.03378802 -0.2192793 -0.78618732
## Petal.Width   0.5674952 -0.03545628 -0.5829003  0.58044745
# plot method
plot(ir.pca, type = "l")

# summary method will indicate the variance explained by each
# of the components.
summary(ir.pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7125 0.9524 0.36470 0.16568
## Proportion of Variance 0.7331 0.2268 0.03325 0.00686
## Cumulative Proportion  0.7331 0.9599 0.99314 1.00000

now we want a visual representation of our PCA plot

## Loading required package: plyr
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:reshape':
## 
##     rename, round_any
## Loading required package: scales